Abstractive Text Summarization (2)

  • Mohammed Zaieda
  • Omar Ali

Problem Statement

Abstractive text summarization is the task of generating a short and concise summary from important ideas/phrases of the text. Text summarization has many applicable benefits in production which include summarizing medical reports for faster analysis and medical actions, media monitoring and surveillance tracking, and data refinement and memory reduction.

Dataset

There are multiple datasets for performing abstractive text summarization. The more relevant ones are:

  • CNN/Daily Mail
  • WikiHow
  • XSum
  • Amazon Fine Food Reviews

The dataset we chose was XSum which consists of 226,711 news articles accompanied with a one-sentence summary. The articles are collected from BBC articles from 2010 up until 2017. For the sake of our project we will be using XSum dataset as it conveys better support in trasnfer learning.



Input/Output Examples

Below are two examples of the desired input and output, where the output is a summarized version of the input.



State of the art



Orignial Model from Literature

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
PEGASUS masks multiple whole sentences rather than smaller continuous text spans.
  • Masking strategies:
    • Masking random sentences.
    • Masking the first m-sentences, also known as Lead.
    • Masking the top m-scored sentences based on their importance using Rouge-1.
  • Uses a transformer encoder to mask sentences, and a transformer decoder to produce the target text.



Proposed Updates

Update #1: Tried different word embedding algorithm: Word2Vec

The word2vec algorithm learns word connections from a huge corpus of text using a neural network model. Once trained, the model can detect synonyms and recommend new words for a sentence. As the name suggests, word2vec associates each different word with a specific vector. The semantic similarity between the words can then be approximated via certain mathematical functions applied on their vectors, such as cosine similarity between vectors.



Update #2: Performed data augmentation: Back Translation

Data augmentation is typically used to reduce overfitting by generating additional training data using some techniques applied on the original training data. Back translation is one of the data augmentation techniques used for NLP tasks which invloves translating some samples from the dataset into another languauge (e.g. from English to French) then translating it back to the original language. This will mostly generate a paraphrased sentence of the original one.



Results

When it comes to the results, the paper ROUGE-2 score found was 24.56, however, the one generated from the pretrained transformer produced 1.934, this is due to reducing the training dataset size due to resources limitations. The first table shows the word2vec modification which as you can see is much lower. The data augmentation were very similar to the actual model.

In conclusion, we believe that if we were allocated the enough resources, interesting results would have been obtained. As such, For future work, we will attempt to keep the model running longer and perform hyperparameter tuning for the word embedding. We might also want to introduce a different type of dataset.





Technical report

  • Programming framework
    • PyTorch
  • Training hardware
    • Google Colab
    • Google Cloud
  • Training time
    • Max. 8 hours (on Google Cloud resources)
  • Number of epochs
    • 20 to 50 epochs
  • Time per epoch
    • Max. 25 min approximately (on Google Cloud resources)

Conclusion

Conclusion and future work (including lessons learned and interesting findings

References

List all references here, the following are only examples

  1. Problem statement url
  2. Dataset URL
  3. Model Reference URL